{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Text retrieval\n",
    "\n",
    "This guide will introduce techniques for organizing text data. It will show how to analyze a large corpus of text, extracting feature vectors for individual documents, in order to be able to retrieve documents with similar content.\n",
    "\n",
    "[scipy](https://www.scipy.org/) and [scikit-learn](scikit-learn.org) are required to run through this document, as well as a corpus of text documents. This code can be adapted to work with a set of documents you collect. For the purpose of this example, we will use the well-known Reuters-21578 dataset with 90 categories. To download this dataset, download it manually from [here](http://disi.unitn.it/moschitti/corpora.htm), or run `download.sh` in the `data` folder (to get all the other data for ml4a-guides as well), or just run:\n",
    "\n",
    "    wget http://disi.unitn.it/moschitti/corpora/Reuters21578-Apte-90Cat.tar.gz\n",
    "    tar -xzf Reuters21578-Apte-90Cat.tar.gz\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import os"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once you've downloaded and unzipped the dataset, take a look inside the folder. It is split into two folders, \"training\" and \"test\". Each of those contains 91 subfolders, corresponding to pre-labeled categories, which will be useful for us later when we want to try classifying the category of an unknown message. In this notebook, we are not worried about training a classifier, so we'll end up using both sets together.\n",
    "\n",
    "Let's note the location of the folder into a variable `data_dir`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "data_dir = '../data/Reuters21578-Apte-90Cat'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's open up a single message and look at the contents. This is the very first message in the training folder, inside of the \"acq\" folder, which is a category apparently containing news of corporate acquisitions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\r\n",
      "COMPUTER TERMINAL SYSTEMS <CPML> COMPLETES SALE\r\n",
      "\n",
      "     COMMACK, N.Y., Feb 26 - Computer Terminal Systems Inc said\r\n",
      "it has completed the sale of 200,000 shares of its common\r\n",
      "stock, and warrants to acquire an additional one mln shares, to\r\n",
      "<Sedio N.V.> of Lugano, Switzerland for 50,000 dlrs.\r\n",
      "    The company said the warrants are exercisable for five\r\n",
      "years at a purchase price of .125 dlrs per share.\r\n",
      "    Computer Terminal said Sedio also has the right to buy\r\n",
      "additional shares and increase its total holdings up to 40 pct\r\n",
      "of the Computer Terminal's outstanding common stock under\r\n",
      "certain circumstances involving change of control at the\r\n",
      "company.\r\n",
      "    The company said if the conditions occur the warrants would\r\n",
      "be exercisable at a price equal to 75 pct of its common stock's\r\n",
      "market price at the time, not to exceed 1.50 dlrs per share.\r\n",
      "    Computer Terminal also said it sold the technolgy rights to\r\n",
      "its Dot Matrix impact technology, including any future\r\n",
      "improvements, to <Woodco Inc> of Houston, Tex. for 200,000\r\n",
      "dlrs. But, it said it would continue to be the exclusive\r\n",
      "worldwide licensee of the technology for Woodco.\r\n",
      "    The company said the moves were part of its reorganization\r\n",
      "plan and would help pay current operation costs and ensure\r\n",
      "product delivery.\r\n",
      "    Computer Terminal makes computer generated labels, forms,\r\n",
      "tags and ticket printers and terminals.\r\n",
      "\r\n"
     ]
    }
   ],
   "source": [
    "post_path = os.path.join(data_dir, \"training\", \"acq\", \"0000005\")\n",
    "with open (post_path, \"r\") as p:\n",
    "    raw_text = p.read()\n",
    "    print(raw_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our collection contains over 15,000 articles with a lot of information. It would take way too long to get through all the information."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['silver', 'cocoa', 'crude', 'coconut', 'jet', 'retail', 'coconut-oil', 'coffee', 'earn', 'copper', 'ship', 'castor-oil', 'cpi', 'sugar', 'dlr', 'hog', 'iron-steel', 'propane', 'veg-oil', 'rapeseed', 'platinum', 'dfl', 'rape-oil', 'rice', 'palmkernel', 'housing', 'zinc', 'jobs', 'tea', 'l-cattle', 'unknown', 'alum', 'lead', 'interest', 'money-supply', 'cotton', 'soybean', 'rubber', 'corn', 'soy-meal', 'orange', 'barley', 'soy-oil', 'heat', 'lei', 'pet-chem', 'gas', 'fuel', 'ipi', 'palladium', 'grain', 'oat', 'nkr', 'yen', 'groundnut-oil', 'naphtha', 'oilseed', 'rye', 'groundnut', 'wheat', 'cpu', 'gold', 'sun-meal', 'nat-gas', 'reserves', 'instal-debt', 'dmk', 'strategic-metal', 'cotton-oil', 'bop', 'wpi', 'nickel', 'tin', 'income', 'trade', 'rand', 'livestock', 'gnp', 'lumber', 'sun-oil', 'palm-oil', 'nzdlr', 'acq', 'carcass', 'copra-cake', 'potato', 'sunseed', 'sorghum', 'meal-feed', 'money-fx', 'lin-oil']\n"
     ]
    }
   ],
   "source": [
    "# this gives us all the groups (from training subfolder, but same for test)\n",
    "groups = [g for g in os.listdir(os.path.join(data_dir, \"training\")) if os.path.isdir(os.path.join(data_dir, \"training\", g))]\n",
    "print groups"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's load all of our documents (news articles) into a single list called `docs`. We'll iterate through each group, grab all of the posts in each group (from both training and test directories), and add the text of the post into the `docs` list. We will make sure to exclude duplicate posts by cheking if we've seen the post index before."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "reading group 1 / 91\n",
      "reading group 11 / 91\n",
      "reading group 21 / 91\n",
      "reading group 31 / 91\n",
      "reading group 41 / 91\n",
      "reading group 51 / 91\n",
      "reading group 61 / 91\n",
      "reading group 71 / 91\n",
      "reading group 81 / 91\n",
      "reading group 91 / 91\n",
      "\n",
      "we have 12897 documents in 91 groups\n",
      "\n",
      "here is document 100:\n",
      "\n",
      "\n",
      "DUTCH COCOA PROCESSORS UNHAPPY WITH ICCO BUFFER\n",
      "\n",
      "<AUTHOR>    By Jeremy Lovell, Reuters</AUTHOR>\n",
      "    ROTTERDAM, June 1 - Dutch cocoa processors are unhappy with\n",
      "the intermittent buying activities of the International Cocoa\n",
      "Organization's buffer stock manager, industry sources told\n",
      "Reuters.\n",
      "    \"The way he is operating at the moment is doing almost\n",
      "nothing to support the market. In fact he could be said to be\n",
      "actively depressing it,\" one company spokesman said.\n",
      "    Including the 3,000 tonnes he acquired on Friday, the total\n",
      "amount of cocoa bought by the buffer stock manager since he\n",
      "recently began support operations totals 21,000 tonnes.\n",
      "    Despite this buying, the price of cocoa is well under the\n",
      "1,600 Special Drawing Rights, SDRs, a tonne level below which\n",
      "the bsm is obliged to buy cocoa off the market.\n",
      "    \"Even before he started operations, traders estimated the\n",
      "manager would need to buy at least up to his 75,000 tonnes\n",
      "maximum before prices moved up to or above the 1,600 SDR level,\n",
      "and yet he appears reluctant to do so,\" one manufacturer said.\n",
      "    \"We all hoped the manager would move into the market to buy\n",
      "up to 75,000 tonnes in a fairly short period, and then simply\n",
      "step back,\" he added.\n",
      "    \"The way the manager is only nibbling at the edge of the\n",
      "market at the moment is actually depressing sentiment and the\n",
      "market because everyone is holding back from both buying and\n",
      "selling waiting to see what the manager will do next,\" one\n",
      "processor said.\n",
      "    \"As long as his buying tactics remain the same the market is\n",
      "likely to stay in the doldrums, and I see no indication he is\n",
      "about to alter his methods,\" he added.\n",
      "    Processors and chocolate manufacturers said consumer prices\n",
      "for cocoa products were unlikely to be affected by buffer stock\n",
      "buying for some time to come.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "docs = []\n",
    "post_idx = []\n",
    "for g, group in enumerate(groups):\n",
    "    if g%10==0:\n",
    "        print (\"reading group %d / %d\"%(g+1, len(groups)))\n",
    "    posts_training = [os.path.join(data_dir, \"training\", group, p) for p in os.listdir(os.path.join(data_dir, \"training\", group)) if os.path.isfile(os.path.join(data_dir, \"training\", group, p))]\n",
    "    posts_test = [os.path.join(data_dir, \"test\", group, p) for p in os.listdir(os.path.join(data_dir, \"test\", group)) if os.path.isfile(os.path.join(data_dir, \"test\", group, p))]\n",
    "    posts = posts_training + posts_test\n",
    "    for post in posts:\n",
    "        idx = post.split(\"/\")[-1]\n",
    "        if idx not in post_idx:\n",
    "            post_idx.append(idx)\n",
    "            with open(post, \"r\") as p:\n",
    "                raw_text = p.read()\n",
    "                raw_text = re.sub(r'[^\\x00-\\x7f]',r'', raw_text) \n",
    "                docs.append(raw_text)\n",
    "\n",
    "print(\"\\nwe have %d documents in %d groups\"%(len(docs), len(groups)))\n",
    "print(\"\\nhere is document 100:\\n%s\"%docs[100])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will now use `sklearn`'s `TfidfVectorizer` to compute the tf-idf matrix of our collection of documents. The tf-idf matrix is an `n`x`m` matrix with the `n` rows corresponding to our `n` documents and the `m` columns corresponding to our terms. The values corresponds to the \"importance\" of each term to each document, where importance is *. In this case, terms are just all the unique words in the corpus, minus english _stopwords_, which are the most common words in the english language, e.g. \"it\", \"they\", \"and\", \"a\", etc. In some cases, terms can be n-grams (n-length sequences of words) or more complex, but usually just words.\n",
    "\n",
    "To compute our tf-idf matrix, run:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<12897x33972 sparse matrix of type '<type 'numpy.float64'>'\n",
       "\twith 740808 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "\n",
    "vectorizer = TfidfVectorizer(stop_words='english')\n",
    "tfidf = vectorizer.fit_transform(docs)\n",
    "\n",
    "tfidf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that the variable `tfidf` is a sparse matrix with a row for each document, and a column for each unique term in the corpus. \n",
    "\n",
    "Thus, we can interpret each row of this matrix as a feature vector which describes a document. Two documents which have identical rows have the same collection of words in them, although not necessarily in the same order; word order is not preserved in the tf-idf matrix. Regardless, it seems reasonable to expect that if two documents have similar or close tf-idf vectors, they probably have similar content."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\r\n",
      "PEGASUS GOLD <PGULF> STARTS MILLING IN MONTANA\r\n",
      "\n",
      "    JEFFERSON CITY, Mont., March 27 - Pegasus Gold Inc said\r\n",
      "milling operations have started at its Montana Tunnels open-pit\r\n",
      "gold, silver, zinc and lead mine near Helena.\r\n",
      "    The start-up is three months ahead of schedule and six mln\r\n",
      "dlrs under budget, the company said. Original capital cost of\r\n",
      "the mine was 57.5 mln dlrs, but came in at 51.5 mln dlrs, the\r\n",
      "company said.\r\n",
      "    After a start-up period, the mill is expected to produce \r\n",
      "106,000 ounces of gold, 1,700,000 ounces of silver, 26,000 tons\r\n",
      "of zinc and 5,700 tons of lead on an annual basis from\r\n",
      "4,300,000 tons of ore, the company said.\r\n",
      "\r\n",
      "document's term-frequency pairs:\n",
      "\"march\"=0.03, \"said\"=0.09, \"gold\"=0.34, \"silver\"=0.21, \"lead\"=0.13, \"zinc\"=0.22, \"ore\"=0.10, \"000\"=0.16, \"tons\"=0.30, \"ounces\"=0.21, \"dlrs\"=0.09, \"300\"=0.07, \"mln\"=0.09, \"expected\"=0.06, \"company\"=0.12, \"capital\"=0.06, \"near\"=0.08, \"milling\"=0.25, \"pegasus\"=0.28, \"pgulf\"=0.15, \"starts\"=0.11, \"montana\"=0.27, \"jefferson\"=0.15, \"city\"=0.08, \"mont\"=0.13, \"27\"=0.06, \"operations\"=0.07, \"started\"=0.09, \"tunnels\"=0.16, \"open\"=0.08, \"pit\"=0.14, \"helena\"=0.15, \"start\"=0.16, \"months\"=0.06, \"ahead\"=0.09, \"schedule\"=0.11, \"budget\"=0.08, \"original\"=0.10, \"cost\"=0.08, \"57\"=0.09, \"came\"=0.09, \"51\"=0.08, \"period\"=0.07, \"produce\"=0.09, \"106\"=0.10, \"700\"=0.17, \"26\"=0.06, \"annual\"=0.07, \"basis\"=0.07\n"
     ]
    }
   ],
   "source": [
    "doc_idx = 5\n",
    "\n",
    "doc_tfidf = tfidf.getrow(doc_idx)\n",
    "all_terms = vectorizer.get_feature_names()\n",
    "terms = [all_terms[i] for i in doc_tfidf.indices]\n",
    "values = doc_tfidf.data\n",
    "\n",
    "print(docs[doc_idx])\n",
    "print(\"document's term-frequency pairs:\")\n",
    "print(\", \".join(\"\\\"%s\\\"=%0.2f\"%(t,v) for t,v in zip(terms,values)))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In practice however, the term-document matrix alone has several disadvantages. For one, it is very high-dimensional and sparse (mostly zeroes), thus it is computationally costly. \n",
    "\n",
    "Additionally, it ignores similarity among groups of terms. For example, the words \"seat\" and \"chair\" are related, but in a raw term-document matrix they are separate columns. So two sentences with one of each word will not be computed as similarly.\n",
    "\n",
    "One solution is to use latent semantic analysis (LSA, or sometimes called latent semantic indexing). LSA is a dimensionality reduction technique closely related to principal component analysis, which is commonly used to reduce a high-dimensional set of terms into a lower-dimensional set of \"concepts\" or components which are linear combinations of the terms.\n",
    "\n",
    "To do so, we use `sklearn`'s `TruncatedSVD` function which gives us the LSA by computing a [singular value decomposition (SVD)](https://en.wikipedia.org/wiki/Singular_value_decomposition) of the tf-idf matrix. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.decomposition import TruncatedSVD\n",
    "\n",
    "lsa = TruncatedSVD(n_components=100)\n",
    "tfidf_lsa = lsa.fit_transform(tfidf)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How to interpret this? `lsa` holds our latent semantic analysis, expressing our 100 concepts. It has a vector for each concept, which holds the weight of each term to that concept. `tfidf_lsa` is our transformed document matrix where each document is a weighted sum of the concepts. \n",
    "\n",
    "In a simpler analysis with, for example, two topics (sports and tacos), one concept might assign high weights for sports-related terms (ball, score, tournament) and the other one might have high weights for taco-related concepts (cheese, tomato, lettuce). In a more complex one like this one, the concepts may not be as interpretable. Nevertheless, we can investigate the weights for each concept, and look at the top-weighted ones. For example, here are the top terms in concept 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10 highest-weighted terms in concept 1:\n",
      " - vs : 32799.00\n",
      " - 000 : 1.00\n",
      " - net : 21096.00\n",
      " - loss : 18710.00\n",
      " - cts : 8738.00\n",
      " - revs : 26257.00\n",
      " - shr : 27999.00\n",
      " - profit : 24257.00\n",
      " - avg : 4002.00\n",
      " - shrs : 28014.00\n"
     ]
    }
   ],
   "source": [
    "components = lsa.components_[1]\n",
    "all_terms = vectorizer.get_feature_names()\n",
    "\n",
    "idx_top_terms = sorted(range(len(components)), key=lambda k: components[k])\n",
    "\n",
    "print(\"10 highest-weighted terms in concept 1:\")\n",
    "for t in idx_top_terms[:10]:\n",
    "    print(\" - %s : %0.02f\"%(all_terms[t], t))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The top terms in concept 1 appear related to accounting balance sheets; terms like \"net\", \"loss\", \"profit\".\n",
    "\n",
    "Now, back to our documents. Recall that `tfidf_lsa` is a transformation of our original tf-idf matrix from the term-space into a concept-space. The concept space is much more valuable, and we can use it to query most similar documents. We expect that two documents which about similar things should have similar vectors in `tfidf_lsa`. We can use a simple distance metric to measure the similarity, euclidean distance or cosine similarity being the two most common. \n",
    "\n",
    "Here, we'll select a single query document (index 300), calculate the distance of every other document to our query document, and take the one with the smallest distance to the query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "QUERY DOCUMENT:\n",
      " \n",
      "\r\n",
      "U.S ENERGY SECRETARY PROPOSES OIL TAX INCENTIVES\r\n",
      "\n",
      "    WASHINGTON, March 18 - Energy Secretary John Herrington\r\n",
      "said he will propose tax incentives to increase domestic oil\r\n",
      "and natural gas exploration and production to the Reagan\r\n",
      "Administration for consideration.\r\n",
      "    \"These options boost production, while avoiding the huge\r\n",
      "costs associated with proposals like an oil import fee,\"\r\n",
      "Herrington told a House Energy subcommittee hearing. \"It is my\r\n",
      "intention to submit these proposals to the domestic policy\r\n",
      "council and the cabinet for consideration and review.\"\r\n",
      "    He said proposals, including an increase in the oil\r\n",
      "depletion allowance and repeal of the windfall profits tax,\r\n",
      "should be revenue neutral and promote domestic production at\r\n",
      "the least cost to the economy and the taxpayers.\r\n",
      "    \"The goal of the Administration policies is to increase\r\n",
      "domestic production. I would like to shoot for one mln barrels\r\n",
      "a year.\"\r\n",
      "    The proposals were based on a DOE study released yesterday\r\n",
      "warning the United States was threatened by a growing\r\n",
      "dependence on oil imports.\r\n",
      "    \"We project free world dependence on Persian Gulf oil at 65\r\n",
      "pct by 1995,\" Herrington said.\r\n",
      "    He said it was too soon to say what the Administration\r\n",
      "policy on oil tax incentives would be and indicated there would\r\n",
      "be opposition to tax changes.\r\n",
      "    \"Of course, to move forward with these kinds of options\r\n",
      "would require reopening tax issues settled last year (in the\r\n",
      "tax reform bill) -- an approach which has not, in general, been\r\n",
      "favored by the administration. I think what we need is to\r\n",
      "debate this within the Administration,\" he said.\r\n",
      "    He said the proposals might raise gasoline prices.\r\n",
      "    Herrington did not specifically confirm a report in today's\r\n",
      "Washington Post that he had written to President Reagan urging\r\n",
      "an increase in the oil depletion allowance.\r\n",
      "    Asked about the report by subcommittee members, Herrington\r\n",
      "said various proposals were under consideration and would be\r\n",
      "debated within the Administration to determine which would have\r\n",
      "the most benefits at the least cost.\r\n",
      "\r",
      " \n",
      "MOST SIMILAR DOCUMENT TO QUERY:\n",
      " \n",
      "\r\n",
      "DOE SECRETARY PROPOSES OIL TAX INCENTIVES\r\n",
      "\n",
      "    WASHINGTON, March 18 - Energy Secretary John Herrington\r\n",
      "said he will propose tax incentives to increase domestic oil\r\n",
      "and natural gas exploration and production to the Reagan\r\n",
      "Administration for consideration.\r\n",
      "    \"These options boost production, while avoiding the huge\r\n",
      "costs associated with proposals like an oil import fee,\"\r\n",
      "Herrington told a House Energy subcommittee hearing. \"It is my\r\n",
      "intention to submit these proposals to the domestic policy\r\n",
      "council and the cabinet for consideration and review.\"\r\n",
      "    \"The goal of the Administration policies is to increase\r\n",
      "domestic production. I would like to shoot for one mln barrels\r\n",
      "a day,\" he said.\r\n",
      "    The proposals were based on a DOE study released yesterday\r\n",
      "warning the United States was threatened by a growing\r\n",
      "dependence on oil imports.\r\n",
      "    \"We project free world dependence on Persian Gulf oil at 65\r\n",
      "pct by 1995,\" Herrington said.\r\n",
      "\r\n"
     ]
    }
   ],
   "source": [
    "from scipy.spatial import distance\n",
    "\n",
    "query_idx = 400\n",
    "\n",
    "# take the concept representation of our query document\n",
    "query_features = tfidf_lsa[query_idx]\n",
    "\n",
    "# calculate the distance between query and every other document\n",
    "distances = [ distance.euclidean(query_features, feat) for feat in tfidf_lsa ]\n",
    "    \n",
    "# sort indices by distances, excluding the first one which is distance from query to itself (0)\n",
    "idx_closest = sorted(range(len(distances)), key=lambda k: distances[k])[1:]\n",
    "\n",
    "# print our results\n",
    "query_doc = docs[query_idx]\n",
    "return_doc = docs[idx_closest[0]]\n",
    "print(\"QUERY DOCUMENT:\\n %s \\nMOST SIMILAR DOCUMENT TO QUERY:\\n %s\" %(query_doc, return_doc))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Interesting find! Our query document appears to be about tax incentives for domestic oil and natural gas. Our return document is a related article about the same topic. Try looking at the next few closest results. A quick inspection reveals that most of them are about the same story.\n",
    "\n",
    "Thus we see the value of this procedure. It gives us a way to quickly identify articles which are related to each other. This can greatly aide journalists who have to sift through a lot of content which is not always indexed or organized usefully. \n",
    "\n",
    "More creatively, we can think of other ways this can be made useful. For example, what if instead of making our documents the articles themselves, what if they were made to be paragraphs from the articles? Then, we could potentially discover relevant paragraphs about one topic which are buried in an article which is otherwise about a different topic. We can combine this with handcrafted filters as well (date range, presence of a word or name, etc); perhaps you want to quickly find every quote politican X has made about topic Y. This provides an effective means to do so."
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [Root]",
   "language": "python",
   "name": "Python [Root]"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
